CopulaHiC is R package for differential analysis of Hi-C data. It takes advantage of significant correlations of main diagonals between different Hi-C data sets (cell lines, experiments, etc.) - usually the first 200 to 300 diagonals from the main diagonal are considered. CopulaHiC uses copulas to model these dependancies and then quantifies deviations from the model in a probabilistic way.
This vignette contains the description of our method:
If you want to spare yourself extra reading and jump straight to analysis part, please see the quick start chapter. It contains a step-by-step introduction to Hi-C differential analysis with our package using an example Hi-C data set.
Load the package:
library("CopulaHiC")
To compare Hi-C experiments, you will need 2 files with Hi-C matrices. In this tutorial, we will use Hi-C data sets of human IMR90 and human MSC cells lines from GSE63525 and GSE52457 studies respectively available as sample data in CopulaHiC package. For simplicity we will only use chromosome 18. First, we read the data:
mtx.fname.imr90 <- system.file("extdata", "IMR90-MboI-1_40kb-raw.npz",
package = "CopulaHiC", mustWork = TRUE)
mtx.fname.msc <- system.file("extdata", "MSC-HindIII-1_40kb-raw.npz",
package = "CopulaHiC", mustWork = TRUE)
hic.comparator <- HiCcomparator(mtx.fname.imr90, mtx.fname.msc, mtx.names = c("18"), do.pca = TRUE)
Illustrate Hi-C contact maps with A/B compartments (feel free to adjust margins as you want):
dense.imr90 <- sparse2dense(hic.comparator$maps1[["18"]][c("i","j","val")],
N = hic.comparator$maps.dims[["18"]][1,"n.rows"])
dense.msc <- sparse2dense(hic.comparator$maps2[["18"]][c("i","j","val")],
N = hic.comparator$maps.dims[["18"]][2,"n.rows"])
plot.margins <- 0.1
n.plots <- 4
par(mfrow = c(2,2), mai = rep(plot.margins, n.plots))
plot_contact_map(dense.imr90)
plot_contact_map(dense.msc)
plot_pc_vector(hic.comparator$pc1.maps1[["18"]])
plot_pc_vector(hic.comparator$pc1.maps2[["18"]])
Figure 1.1: IMR90 and MSC contact maps with A/B compartments of human chromosome 18.
Determine TADs for both Hi-C maps and illustrate them:
tads.imr90 <- map2tads(dense.imr90, resolution = 40*1000, window.bp = 400*1000, delta.bp = 80*1000)
tads.msc <- map2tads(dense.msc, resolution = 40*1000, window.bp = 400*1000, delta.bp = 80*1000)
plot_with_inset(list(dense.imr90), c(600,900), c(600,900), args.regions = list(tads.imr90))
plot_with_inset(list(dense.msc), c(600,900), c(600,900), args.regions = list(tads.msc))
Figure 1.2: IMR90 contact map of human chromosome 18 with TADs.
Figure 1.3: MSC contact map of human chromosome 18 with TADs.
Compare coverage and decays for IMR90 and MSC, for decays use log10 scales for both X and Y axis:
library("ggplot2")
coverage <- dominating_signal(hic.comparator, which.signal = "coverage")
ggplot(coverage, aes(x = i, y = sum.contacts, color = dataset)) +
geom_point(size = 0.5) +
geom_smooth(alpha = 0.5) +
facet_wrap(~ name, ncol = 1, scales = "free") +
theme(legend.position = "bottom")
decay <- dominating_signal(hic.comparator, which.signal = "decay")
ggplot(decay[decay$diagonal != 0,], aes(x = diagonal, y = mean.contacts, color = dataset)) +
geom_point(size = 0.5) +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~ name, ncol = 1, scales = "free") +
theme(legend.position = "bottom")
Figure 1.4: Coverage comparison between IMR90 and MSC Hi-C data of human chromosome 18.
Figure 1.5: Decay comparison between IMR90 and MSC Hi-C data of human chromosome 18.
First, calculate the fold change and differential maps IMR90 / MSC and IMR90 - MSC respectively. Then, illustrate results with zoomin of 600 - 900 bins region. By default, when plotting fold change maps log10 scale is used. When plotting the differential map, one should indicate square root transformation of data for better visibility (see examples below):
merged <- merge(hic.comparator)
merged.18 <- merged[["18"]]
merged.18$fc <- merged.18$val.x / merged.18$val.y
merged.18$difference <- merged.18$val.x - merged.18$val.y
dense.fc <- sparse2dense(merged.18[c("i","j","fc")], N = hic.comparator$maps.dims[["18"]][1,"n.rows"])
dense.diff <- sparse2dense(merged.18[c("i","j","difference")],
N = hic.comparator$maps.dims[["18"]][1,"n.rows"])
plot_with_inset(list(dense.fc, fc = TRUE, colors.pal = c("blue","white","red")), c(600,900), c(600,900), which.map = "contact.map")
plot_with_inset(list(dense.diff, breaks = 100, sqrt.transform = TRUE), c(600,900), c(600,900), which.map = "diff.map")
Figure 1.6: Fold Change map of IMR90 / MSC of human chromosome 18.
Figure 1.7: Differential map of IMR90 - MSC of human chromosome 18.
Check the correlations between corresponding diagonals of IMR90 and MSC:
library("reshape2")
decay.cors <- decay_correlation(hic.comparator)
# wide to long
decay.cors.long <- reshape2::melt(decay.cors[c("name","diagonal","pcc","rho","tau")],
id.vars = c("name","diagonal"), variable.name = "correlation",
value.name = "coefficient")
# remove 0 diagonal (as it is non informative anyways) and illustrate results
ggplot(decay.cors.long[decay.cors.long$diagonal != 0,],
aes(x = diagonal, y = coefficient, color = correlation)) +
geom_point(size = 0.7) +
scale_x_continuous(breaks = c(1,seq(0, max(decay.cors.long$diagonal), by = 250)[-1])) +
facet_wrap(~ name, ncol = 1, scales = "free") +
theme(legend.position = "bottom")
dc <- decay.cors[c("name","diagonal","pearson.pval","spearman.pval","kendall.pval")]
colnames(dc) <- c("name","diagonal","pearson","spearman","kendall")
decay.sig.long <- reshape2::melt(dc, id.vars = c("name","diagonal"),
variable.name = "correlation", value.name = "pval")
decay.sig.long$neg.log.pval <- -log10(decay.sig.long$pval)
ggplot(decay.sig.long[decay.sig.long$diagonal != 0,],
aes(x = diagonal, y = neg.log.pval, color = correlation)) +
geom_point(size = 0.7) +
geom_hline(yintercept = -log10(0.01)) +
annotate("text", 1800, -log10(0.01), vjust = -1, label = "pval = 0.01") +
scale_x_continuous(breaks = c(1,seq(0, max(decay.sig.long$diagonal), by = 250)[-1])) +
ylab("-log10(pval)") +
facet_wrap(~ name, ncol = 1, scales = "free") +
theme(legend.position = "bottom")
Figure 1.8: Correlations between diagonals of IMR90 and MSC Hi-C data of human chromosome 18.
Figure 1.9: Significances of correlations between diagonals of IMR90 and MSC Hi-C data of human chromosome 18.
Construct Hi-C copula models (if you are running this on a cluster, you can increase the number of cores - it will speed up the model construction significantly):
hic.copula <- HiCcopula(hic.comparator, diagonals = 1:240, n.cores = 1)
Plot marginals gamma fit for diagonal 10 and copula fit for diagonals, say 1, 10, 50 and 200:
model.18.1 <- hic.copula$model[["18"]][["1"]]
model.18.10 <- hic.copula$model[["18"]][["10"]]
model.18.50 <- hic.copula$model[["18"]][["50"]]
model.18.200 <- hic.copula$model[["18"]][["200"]]
library("fitdistrplus")
library("VineCopula")
library("gridExtra")
plot(model.18.10$marginal.x)
plot(model.18.10$marginal.y)
c1 <- plot(model.18.1$bf.copula, main = "diagonal 1")
c2 <- plot(model.18.10$bf.copula, main = "diagonal 10")
c3 <- plot(model.18.50$bf.copula, main = "diagonal 50")
c4 <- plot(model.18.200$bf.copula, main = "diagonal 200")
grid.arrange(c1,c2,c3,c4, ncol=2)
Figure 1.10: Gamma fit of marginal distribution X (IMR90) of diagonal 10.
Figure 1.11: Gamma fit of marginal distribution Y (MSC) of diagonal 10.
Figure 1.12: Copula fit of U (IMR90) vs V (MSC) for diagonals 1,10,50 and 200.
When the background model is constructed differential map can be computed (and visualised):
diffmap <- hicdiff(hic.copula)
diffmap.dense <- hicdiff2mtx(diffmap, hic.copula$maps.dims)
# plot diffmap
diffmap.18.dense <- diffmap.dense[["18"]]
plot_with_inset(list(diffmap.18.dense, breaks = 100), c(1300,1700), c(1300,1700), which.map = "diff.map")
Figure 1.13: Differential/Significance map of human chromosome 18 IMR90 vs MSC cell lines.
Finally groups of cells with significant depletion/enrichment can be identified (with some precision) as rectangle-like regions containing connected components:
lrdi <- differential_interactions(hic.copula)
# get rectangle like regions
regions <- lrdi[["18"]]$interacting.regions
# plot them --> only those which contain at least 5 cells inside connected component
plot_with_inset(list(diffmap.18.dense, breaks = 100), c(1300,1700), c(1300,1700), which.map = "diff.map",
args.regions = list(regions[regions$n.cells >= 5, 2:6], pal.colors = c("blue","red")))
plot_with_inset(list(diffmap.18.dense, breaks = 100), c(900,1200), c(900,1200), which.map = "diff.map",
args.regions = list(regions[regions$n.cells >= 5, 2:6], pal.colors = c("blue","red")))
Figure 1.14: Detection of differential long range interactions in IMR90 vs MSC map. Bilinear model fitness to depleted significances is used in order to determine threshold for depleted long range interactions.
Figure 1.15: Detection of differential long range interactions in IMR90 vs MSC map. Bilinear model fitness to enriched significances is used in order to determine threshold for enriched long range interactions.
Figure 1.16: Differential/Significance map of human chromosome 18, IMR90 vs MSC cell lines with long range differential interactions (zoomin 1).
Figure 1.17: Differential/Significance map of human chromosome 18, IMR90 vs MSC cell lines with long range differential interactions (zoomin 2).
Figure 2.1: Contact map of human chromosome 18 IMR90 cell line in 40kb resolution.
Hi-C is a complex protocol with many sources of bias (Yaffe and Tanay 2011). This makes the analysis of this type of data very challenging. Therefore, any biological study utilizing Hi-C data performed without taking these biases into account will lead to severely biased results.
Here we consider a problem of comparing 2 Hi-C datasets. Depending on experimental design it may be, for example, different cell lines or different experimental conditions. This sort of studies apper frequently in literature making the problem of countering biases important (Le Dily et al. 2014; Dixon et al. 2015; Lun and Smyth 2015). We present key issues arising in this type of assays and suggest a model, which have potential to improve the reliability of Hi-C differential analysis.
Figure 3.1: Coverages (sum of contacts per region) of human chromosome 19 from 2 Hi-C experiments: IMR90 and MSC cell lines.
Figure 3.2: Decays (mean of contacts per diagonal) of human chromosome 19 from 2 Hi-C experiments: IMR90 and MSC cell lines.
Figure 3.3: Fold Change map of IMR90 / MSC of human chromosome 19.
Figure 3.4: Coverage (sum of contacts per region) of human chromosome 19 from 2 Hi-C experiments: IMR90 and MSC cell lines after Iterative Correction.
Figure 3.5: Decays (mean of contacts per diagonal) of human chromosome 19 from 2 Hi-C experiments: IMR90 and MSC cell lines after Iterative Correction.
Figure 3.6: Fold Change map of IMR90 / MSC of human chromosome 19 with both maps iteratively corrected.
Figure 3.7: Comparison of contact decays (mean of contacts per diagonal) of human chromosomes 18 and 19 from 2 Hi-C experiments: IMR90 and MSC cell lines between HiCNorm and Iterative Correction normalization methods.
A possible workaround could be the normalization avoiding method, which tries to model similarities in coverage and contact decay between both Hi-C experiments simultaneously and then seeks for deviations from the model.
Figure 4.1: Correlations between corresponding diagonals (i.e. 1 vs 1, 2 vs 2, etc) of IMR90 replicate 1 vs replicate 2 Hi-C contact map of human chromosome 19.
Figure 4.2: Significances of the above correlations.
Figure 4.3: Correlations between corresponding diagonals (i.e. 1 vs 1, 2 vs 2, etc) of IMR90 vs MSC Hi-C contact map of human chromosome 19.
Figure 4.4: Significances of diagonals correlations.
The trends shown above are representative of many other cell types and chromosomes analyzed by us.
The above observation prompted us to model the dependence between Hi-C contact maps as the global underlying trend and then try to find significant deviations from the model as the local, relevant differences. The common way to model dependencies between datasets is with the use of copulas. Copulas are multivariate distributions with uniform marginals. They allow to separate the modelling of marginal distributions from the modelling of dependencies between random variables. Copulas have a long history of data modelling, for more details on the topic see for example (embrechts2001modelling; Trivedi, Zimmer, and others 2007). A great advantage of using copulas is that there exist a lot of rich and well documented libraries dedicated to copula modelling, which makes model selection quick and straightforward process (Hofert et al. 2017; Jun Yan 2007; Ivan Kojadinovic and Jun Yan 2010; Marius Hofert and Martin Mächler 2011; Schepsmeier et al. 2018).
Our package is build on top of the VineCopula library, which can easily handle model selection for bivariate distributions. The pipeline can be summarized with below panels.Figure 4.5: Input, a pair of Hi-C contact maps to be compared.
Figure 4.6: Model selection and p-value calculation.
Figure 4.7: Resulting significance map.
As can be seen on last panel our model has the abillity to identify even weak long range differential interactions present in the data.
Figure 5.1: Selection of significantly depleted/enriched regions.
Figure 5.2: Selection of significantly depleted regions, bilinear model fitting.
Figure 5.3: Selection of significantly enriched regions, bilinear model fitting.
In the last step a zero matrix with the same shape as initial significance matrix is constructed. Cells coordinates corresponding to that of significant ones are marked with 1. Resulting binary matrix is scanned for connected components. To perform a connected components search, the clump function from the raster package is used (Hijmans 2018). The algorithm outputs 2 elements:
Figure 5.4: Long range differential interactions with connected components containing at least 5 significant cells.
Figure 5.5: Long range differential interactions with connected components containing at least 10 significant cells.
Dixon, Jesse R, Inkyung Jung, Siddarth Selvaraj, Yin Shen, Jessica E Antosiewicz-Bourget, Ah Young Lee, Zhen Ye, et al. 2015. “Chromatin Architecture Reorganization During Stem Cell Differentiation.” Nature 518 (7539). Nature Publishing Group: 331.
Hijmans, Robert J. 2018. Raster: Geographic Data Analysis and Modeling. https://CRAN.R-project.org/package=raster.
Hofert, Marius, Ivan Kojadinovic, Martin Maechler, and Jun Yan. 2017. Copula: Multivariate Dependence with Copulas. https://CRAN.R-project.org/package=copula.
Hu, Ming, Ke Deng, Siddarth Selvaraj, Zhaohui Qin, Bing Ren, and Jun S Liu. 2012. “HiCNorm: Removing Biases in Hi-c Data via Poisson Regression.” Bioinformatics 28 (23). Oxford University Press: 3131–3.
Imakaev, Maxim, Geoffrey Fudenberg, Rachel Patton McCord, Natalia Naumova, Anton Goloborodko, Bryan R Lajoie, Job Dekker, and Leonid A Mirny. 2012. “Iterative Correction of Hi-c Data Reveals Hallmarks of Chromosome Organization.” Nature Methods 9 (10). Nature Publishing Group: 999.
Ivan Kojadinovic, and Jun Yan. 2010. “Modeling Multivariate Distributions with Continuous Margins Using the copula R Package.” Journal of Statistical Software 34 (9): 1–20. http://www.jstatsoft.org/v34/i09/.
Jun Yan. 2007. “Enjoy the Joy of Copulas: With a Package copula.” Journal of Statistical Software 21 (4): 1–21. http://www.jstatsoft.org/v21/i04/.
Lajoie, Bryan R, Job Dekker, and Noam Kaplan. 2015. “The Hitchhiker’s Guide to Hi-c Analysis: Practical Guidelines.” Methods 72. Elsevier: 65–75.
Le Dily, François, Davide Baù, Andy Pohl, Guillermo P Vicent, François Serra, Daniel Soronellas, Giancarlo Castellano, et al. 2014. “Distinct Structural Transitions of Chromatin Topological Domains Correlate with Coordinated Hormone-Induced Gene Regulation.” Genes & Development 28 (19). Cold Spring Harbor Lab: 2151–62.
Lieberman-Aiden, Erez, Nynke L Van Berkum, Louise Williams, Maxim Imakaev, Tobias Ragoczy, Agnes Telling, Ido Amit, et al. 2009. “Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome.” Science 326 (5950). American Association for the Advancement of Science: 289–93.
Lun, Aaron TL, and Gordon K Smyth. 2015. “DiffHic: A Bioconductor Package to Detect Differential Genomic Interactions in Hi-c Data.” BMC Bioinformatics 16 (1). BioMed Central: 258.
Marius Hofert, and Martin Mächler. 2011. “Nested Archimedean Copulas Meet R: The nacopula Package.” Journal of Statistical Software 39 (9): 1–20. http://www.jstatsoft.org/v39/i09/.
Schepsmeier, Ulf, Jakob Stoeber, Eike Christian Brechmann, Benedikt Graeler, Thomas Nagler, and Tobias Erhardt. 2018. VineCopula: Statistical Inference of Vine Copulas. https://CRAN.R-project.org/package=VineCopula.
Trivedi, Pravin K, David M Zimmer, and others. 2007. “Copula Modeling: An Introduction for Practitioners.” Foundations and Trends in Econometrics 1 (1). Now Publishers, Inc.: 1–111.
Yaffe, Eitan, and Amos Tanay. 2011. “Probabilistic Modeling of Hi-c Contact Maps Eliminates Systematic Biases to Characterize Global Chromosomal Architecture.” Nature Genetics 43 (11). Nature Publishing Group: 1059.
Yang, Tao, Feipeng Zhang, Galip Gurkan Yardimci, Fan Song, Ross C Hardison, William Stafford Noble, Feng Yue, and Qunhua Li. 2017. “HiCRep: Assessing the Reproducibility of Hi-c Data Using a Stratum-Adjusted Correlation Coefficient.” Genome Research. Cold Spring Harbor Lab, gr–220640.